Problem Statement
This is an analysis to predict whether a patient has a
high risk of heart disease based on the variables
available.
This prediction analysis is very important for insurance industry to know the condition of the patient before signing the insurance contract and for health care industry to detect as early as possible whether the patient is at risk of heart disease.
We start by activating all of our packages.
# Load necessary packages
pacman::p_load(tidyverse, lubridate, tidymodels, skimr, GGally, ggstatsplot,
usemodels, janitor, doParallel,ggthemes, ggthemr, plotly, vip,
shiny, shinydashboard, DT, caret,jtools, interactions,huxtable,
Hmisc, broom, ggstatsplot, glue)| Variable (Feature) Name | Description |
|---|---|
| age | Age of the person |
| sex | Sex of the person (1, 0) |
| cp | Chest Pain Type (1,2,3,4) |
| trtbps | Resting blood pressure (in mmHg) |
| chol | cholesterol (in mmHg) |
| fbs | fasting blood sugar >120 mg/dl (1 = True, 0 = False) |
| restecg | resting electrocardiography results (0, 1, 2) |
| thalachh | maximum heart rate achieved |
| exng | Exercise induced angina (1 = Yes, 0 = No) |
| oldpeak | Previous peak |
| slp | slope (0,1,2) |
| caa | number of major vessels (0,1,2,3) |
| thall | thal rate |
| output | The risk of heart attack (1 = Yes, 0 = No) |
Data was originally sourced from Kaggle
Importing the data
heart <- read_csv("heart.csv") We start by exploring our data using glimpse and skim
glimpse(heart)## Rows: 303
## Columns: 14
## $ age <dbl> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
## $ sex <dbl> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1~
## $ cp <dbl> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0~
## $ trtbps <dbl> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol <dbl> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ restecg <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
## $ thalachh <dbl> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exng <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slp <dbl> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1~
## $ caa <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thall <dbl> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ output <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
skim(heart)| Name | heart |
| Number of rows | 303 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 54.37 | 9.08 | 29 | 47.5 | 55.0 | 61.0 | 77.0 | ▁▆▇▇▁ |
| sex | 0 | 1 | 0.68 | 0.47 | 0 | 0.0 | 1.0 | 1.0 | 1.0 | ▃▁▁▁▇ |
| cp | 0 | 1 | 0.97 | 1.03 | 0 | 0.0 | 1.0 | 2.0 | 3.0 | ▇▃▁▅▁ |
| trtbps | 0 | 1 | 131.62 | 17.54 | 94 | 120.0 | 130.0 | 140.0 | 200.0 | ▃▇▅▁▁ |
| chol | 0 | 1 | 246.26 | 51.83 | 126 | 211.0 | 240.0 | 274.5 | 564.0 | ▃▇▂▁▁ |
| fbs | 0 | 1 | 0.15 | 0.36 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | ▇▁▁▁▂ |
| restecg | 0 | 1 | 0.53 | 0.53 | 0 | 0.0 | 1.0 | 1.0 | 2.0 | ▇▁▇▁▁ |
| thalachh | 0 | 1 | 149.65 | 22.91 | 71 | 133.5 | 153.0 | 166.0 | 202.0 | ▁▂▅▇▂ |
| exng | 0 | 1 | 0.33 | 0.47 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▃ |
| oldpeak | 0 | 1 | 1.04 | 1.16 | 0 | 0.0 | 0.8 | 1.6 | 6.2 | ▇▂▁▁▁ |
| slp | 0 | 1 | 1.40 | 0.62 | 0 | 1.0 | 1.0 | 2.0 | 2.0 | ▁▁▇▁▇ |
| caa | 0 | 1 | 0.73 | 1.02 | 0 | 0.0 | 0.0 | 1.0 | 4.0 | ▇▃▂▁▁ |
| thall | 0 | 1 | 2.31 | 0.61 | 0 | 2.0 | 2.0 | 3.0 | 3.0 | ▁▁▁▇▆ |
| output | 0 | 1 | 0.54 | 0.50 | 0 | 0.0 | 1.0 | 1.0 | 1.0 | ▇▁▁▁▇ |
heart %>%
count(output)| output | n |
|---|---|
| 0 | 138 |
| 1 | 165 |
Observation:
heart <- heart %>%
mutate(thall_2 = thall, caa_2 = caa)
heart$thall_2 <- replace(heart$thall_2, heart$thall_2<1, 1)
heart$caa_2 <- replace(heart$caa_2, heart$caa_2>2,2)heart <- heart %>%
complete() %>%
dplyr::mutate_all(as.factor) %>%
mutate(across(c(age,trtbps,chol,thalachh,oldpeak),as.numeric),
output = forcats::fct_relevel(output,"1"))# The goals are to find out whether the variables are already set into the correct class
# and whether the mutate for thall and caa is well done.
skim(heart)| Name | heart |
| Number of rows | 303 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| factor | 11 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1 | FALSE | 2 | 1: 207, 0: 96 |
| cp | 0 | 1 | FALSE | 4 | 0: 143, 2: 87, 1: 50, 3: 23 |
| fbs | 0 | 1 | FALSE | 2 | 0: 258, 1: 45 |
| restecg | 0 | 1 | FALSE | 3 | 1: 152, 0: 147, 2: 4 |
| exng | 0 | 1 | FALSE | 2 | 0: 204, 1: 99 |
| slp | 0 | 1 | FALSE | 3 | 2: 142, 1: 140, 0: 21 |
| caa | 0 | 1 | FALSE | 5 | 0: 175, 1: 65, 2: 38, 3: 20 |
| thall | 0 | 1 | FALSE | 4 | 2: 166, 3: 117, 1: 18, 0: 2 |
| output | 0 | 1 | FALSE | 2 | 1: 165, 0: 138 |
| thall_2 | 0 | 1 | FALSE | 3 | 2: 166, 3: 117, 1: 20 |
| caa_2 | 0 | 1 | FALSE | 3 | 0: 175, 1: 65, 2: 63 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 21.38 | 8.94 | 1 | 14.5 | 22 | 28.0 | 41 | ▃▅▇▆▂ |
| trtbps | 0 | 1 | 22.86 | 10.49 | 1 | 15.0 | 23 | 29.0 | 49 | ▃▅▇▃▂ |
| chol | 0 | 1 | 74.03 | 38.50 | 1 | 43.0 | 71 | 103.5 | 152 | ▅▇▇▆▅ |
| thalachh | 0 | 1 | 50.16 | 21.19 | 1 | 34.5 | 53 | 66.0 | 91 | ▂▃▆▇▃ |
| oldpeak | 0 | 1 | 10.82 | 10.34 | 1 | 1.0 | 9 | 17.0 | 40 | ▇▃▂▂▁ |
glimpse(heart)## Rows: 303
## Columns: 16
## $ age <dbl> 30, 4, 8, 23, 24, 24, 23, 11, 19, 24, 21, 15, 16, 31, 25, 17,~
## $ sex <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1~
## $ cp <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0~
## $ trtbps <dbl> 32, 23, 23, 15, 15, 29, 29, 15, 44, 35, 29, 23, 23, 9, 35, 15~
## $ chol <dbl> 65, 81, 36, 68, 146, 26, 117, 93, 32, 10, 70, 104, 96, 43, 11~
## $ fbs <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ restecg <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
## $ thalachh <dbl> 50, 85, 72, 77, 63, 48, 53, 73, 62, 74, 60, 39, 71, 44, 62, 5~
## $ exng <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
## $ oldpeak <dbl> 23, 33, 15, 9, 7, 5, 14, 1, 6, 17, 13, 3, 7, 18, 11, 17, 1, 2~
## $ slp <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1~
## $ caa <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thall <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ output <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ thall_2 <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ caa_2 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
We would like to see the relationships among numeric variables
heart %>%
ggcorrmat(colors = c("red",
"white",
"green"))From the correlation matrix, it seems that there is no significant correlation among variables. We also can see the patterns among numerical variables using ggpair()
ggthemr("fresh")
heart %>%
select(output, where(is.numeric)) %>%
ggpairs()From the ggpair, we don’t see any unique relationship among variables so we can proceed to the next step. Now, we start to observing relationship among categorical variables and output.
heart %>%
ggbarstats(x = sex, y = output)heart %>%
ggbarstats(x = cp, y = output)heart %>%
ggbarstats(x = fbs, y = output)heart %>%
ggbarstats(x = restecg, y = output)heart %>%
ggbarstats(x = exng, y = output)heart %>%
ggbarstats(x = slp, y = output)heart %>%
ggbarstats(x = caa, y = output)heart %>%
ggbarstats(x = thall, y = output)After taking a glance in our data, we start with the predictive modelling.
set.seed(210725)
heart_split <- heart %>%
initial_split(prop = .75,
strata = "output")
class(heart_split)## [1] "mc_split" "rsplit"
heart_training <- heart_split %>% training()
heart_testing <- heart_split %>% testing()Splitting here is pretty straightforward. We split the data into two parts, training and testing data. We use stratified sampling for output in order to get the balance output per category of output (0 or 1).
Feature engineering of the data.
recipe1 <- recipe(output~age+sex+cp+thalachh+exng,
data = heart_training) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
recipe2 <- recipe(output~., data = heart_training) %>%
step_rm(caa,thall, restecg, fbs) %>%
step_normalize(all_numeric_predictors())Recipe one is basically the trial. Recipe two is based on first thorough analysis of the data
We choose
random forest(trees = 1000) to run all of our models.
RF_model <- rand_forest() %>%
set_args(mtry = tune(),
trees =1000) %>%
set_engine("ranger",
importance = "impurity") %>%
set_mode("classification")WorkflowRF_workflow_1<- workflows::workflow() %>%
add_recipe(recipe1) %>%
add_model(RF_model)
RF_workflow_2 <- workflows::workflow() %>%
add_recipe(recipe2) %>%
add_model(RF_model) Cross-Validationset.seed(210725)
heartcv10 <- vfold_cv(heart_training, 10)Parallel Processing 1doParallel::registerDoParallel()
set.seed(210725)
heart_tuned_1 <- RF_workflow_1 %>%
tune_grid(resamples = heartcv10,
grid = 10)Parallel Processing 2set.seed(210725)
heart_tuned_2 <- RF_workflow_2%>%
tune_grid(resamples = heartcv10,
grid = 10)Collect Metricsheart_tuned_results_1 <- heart_tuned_1 %>%
collect_metrics()
heart_tuned_results_2 <- heart_tuned_2 %>%
collect_metrics()Selecting Best Metricsparameters_heart_tuned_1 <- heart_tuned_1%>%
select_best(metric = "roc_auc")
parameters_heart_tuned_2 <- heart_tuned_2 %>%
select_best(metric = "roc_auc")Finalize Workflowfinalized_workflow_heart_1 <- RF_workflow_1 %>%
finalize_workflow(parameters_heart_tuned_1)
finalized_workflow_heart_2 <- RF_workflow_2 %>%
finalize_workflow(parameters_heart_tuned_2)Last Fitfit_heart_1 <- finalized_workflow_heart_1 %>%
last_fit(heart_split)
fit_heart_2 <- finalized_workflow_heart_2 %>%
last_fit(heart_split) We would like to assess the model by the
metrics(roc_auc, accuracy, f_meas, precision, recall). The model is visualized byconfusion matrixandroc_auc curve, compared with based model. Lastly, we also assess thevariable importanceof each model.
Collecting Metrics and Predictionsperformance_heart_1 <- fit_heart_1 %>% collect_metrics()
predictions_heart_1 <- fit_heart_1 %>% collect_predictions()
performance_heart_2 <- fit_heart_2 %>% collect_metrics()
predictions_heart_2 <- fit_heart_2 %>% collect_predictions()Confusion Matrixpredictions_heart_1 %>%
conf_mat(truth = output,
estimate = .pred_class) %>%
pluck(1) %>%
as_tibble() %>%
mutate(cm_colors = ifelse(Truth == 1 & Prediction == 1, "True Positive",
ifelse(Truth == 1 & Prediction == 0, "False Negative",
ifelse(Truth == 0 & Prediction == 1, "False Positive",
"True Negative")))) %>%
ggplot(aes(x = Prediction, y = Truth)) +
geom_tile(aes(fill = cm_colors)) +
scale_fill_manual(values = c("True Positive" = "green3",
"True Negative" = "green1",
"False Positive" = "tomato3",
"False Negative" = "tomato1")) +
geom_text(aes(label = n), color = "white", size = 10) +
geom_label(aes(label = cm_colors), vjust = 2) +
labs(title = "Confusion Matrix Model 1") +
ggthemes::theme_fivethirtyeight() +
theme(axis.title = element_text(),
legend.position = "none")predictions_heart_2 %>%
conf_mat(truth = output,
estimate = .pred_class) %>%
pluck(1) %>%
as_tibble() %>%
mutate(cm_colors = ifelse(Truth == 1 & Prediction == 1, "True Positive",
ifelse(Truth == 1 & Prediction == 0, "False Negative",
ifelse(Truth == 0 & Prediction == 1, "False Positive",
"True Negative")))) %>%
ggplot(aes(x = Prediction, y = Truth)) +
geom_tile(aes(fill = cm_colors)) +
scale_fill_manual(values = c("True Positive" = "green3",
"True Negative" = "green1",
"False Positive" = "tomato3",
"False Negative" = "tomato1")) +
geom_text(aes(label = n), color = "white", size = 10) +
geom_label(aes(label = cm_colors), vjust = 2) +
labs(title = "Confusion Matrix Model 2") +
ggthemes::theme_fivethirtyeight() +
theme(axis.title = element_text(),
legend.position = "none")Roc_Auc ModelCreating
Null Model
baseline_model <- null_model() %>%
set_engine("parsnip") %>%
set_mode("classification")
baseline_workflow_1 <- workflow() %>%
add_recipe(recipe1) %>%
add_model(baseline_model) %>%
fit_resamples(heartcv10,
control = control_resamples(save_pred = T)
)
performance_BASELINE_1 <- baseline_workflow_1 %>% collect_metrics()
predictions_BASELINE_1 <- baseline_workflow_1 %>% collect_predictions()Adding algorithm columns for each model and combining it.
predictions_heart_1 <- predictions_heart_1 %>%
mutate(algorithm = "RF 1")
predictions_heart_2 <- predictions_heart_2 %>%
mutate(algorithm = "RF 2")
predictions_BASELINE_1 <- predictions_BASELINE_1 %>%
mutate(algorithm = "NULL Model")
comparing_predictions_1 <- bind_rows(predictions_heart_1, predictions_heart_2,
predictions_BASELINE_1)Creating
roc_auc curveto see the performance of each model.
comparing_predictions_1 %>%
group_by(algorithm) %>%
roc_curve(truth = output,
.pred_1) %>%
autoplot() +
ggthemes::scale_color_fivethirtyeight() +
labs(title = "Comparions of Predictive Power\nbetween Random Forest & NULL Model\nin in Predicting Heart Attack on Patients",
subtitle = "Random Forest\nPerforms Better in Prediction",
x = "Sensitivity (Recall)",
y = "1 - Specificity (False Positive Rate)",
color = "Prediction Tools") +
theme(legend.position = c(.65, .25))Model 2 is much better than Model 1.
MetricsFinding the metrics and comparing both model.
accuracy_1 <- predictions_heart_1 %>%
metrics(output, .pred_class) %>%
select(-.estimator) %>%
filter(.metric == "accuracy") %>%
rename(accuracy = .estimate)
roc_auc_1 <- performance_heart_1[2,3] %>%
rename(roc_auc = .estimate)
Fmeas_1 <- predictions_heart_1 %>%
f_meas(output, .pred_class) %>%
select(-.estimator) %>%
rename(F_Measure = .estimate)
Result_1 <- tibble(accuracy_1[,2],
roc_auc_1[,1],
Fmeas_1[,2],
"precision" = yardstick::precision(predictions_heart_1, output, .pred_class) %>%
select(.estimate),
"recall" = yardstick::recall(predictions_heart_1, output, .pred_class) %>%
select(.estimate)
) %>%
unnest()
accuracy_2 <- predictions_heart_2 %>%
metrics(output, .pred_class) %>%
select(-.estimator) %>%
filter(.metric == "accuracy") %>%
rename(accuracy = .estimate)
roc_auc_2 <- performance_heart_2[2,3] %>%
rename(roc_auc = .estimate)
Fmeas_2 <- predictions_heart_2 %>%
f_meas(output, .pred_class) %>%
select(-.estimator) %>%
rename(F_Measure = .estimate)
Result_2 <- tibble(accuracy_2[,2],
roc_auc_2[,1],
Fmeas_2[,2],
"precision" = yardstick::precision(predictions_heart_2, output, .pred_class) %>%
select(.estimate),
"recall" = yardstick::recall(predictions_heart_2, output, .pred_class) %>%
select(.estimate)
) %>%
unnest()
Result_summary <- bind_rows(round(Result_1, digits = 2), round(Result_2, digits = 2)) %>%
datatable()
Result_summaryModel 2 is better than Model 1 ! This is proving the significance of variable caa and thall in improving the performance of the model.
Variable Importancefinalized_model_1 <- finalized_workflow_heart_1 %>% fit(heart)
model_summary_1 <- pull_workflow_fit(finalized_model_1)$fit
model_summary_1## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~1L, x), num.trees = ~1000, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 1000
## Sample size: 303
## Number of independent variables: 5
## Mtry: 1
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1568953
feature_importance_1 <- data.frame(importance = model_summary_1$variable.importance) %>%
rownames_to_column("feature") %>%
as_tibble() %>%
mutate(feature = as.factor(feature))
feature_importance_1 %>%
ggplot(aes(x = importance, y = reorder(feature, importance), fill = importance)) +
geom_col(show.legend = F) +
scale_fill_gradient(low = "deepskyblue1", high = "deepskyblue4") +
scale_x_continuous(expand = c(0, 0)) +
labs(
y = NULL,
title = "Feature (Variable) Importance Model 1",
subtitle = "cp is the most important variable") +
ggthemes::theme_fivethirtyeight()finalized_model_2 <- finalized_workflow_heart_2 %>% fit(heart)
model_summary_2 <- pull_workflow_fit(finalized_model_2)$fit
model_summary_2## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L, x), num.trees = ~1000, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 1000
## Sample size: 303
## Number of independent variables: 11
## Mtry: 2
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1251124
feature_importance_2 <- data.frame(importance = model_summary_2$variable.importance) %>%
rownames_to_column("feature") %>%
as_tibble() %>%
mutate(feature = as.factor(feature))
feature_importance_2 %>%
ggplot(aes(x = importance, y = reorder(feature, importance), fill = importance)) +
geom_col(show.legend = F) +
scale_fill_gradient(low = "deepskyblue1", high = "deepskyblue4") +
scale_x_continuous(expand = c(0, 0)) +
labs(
y = NULL,
title = "Feature (Variable) Importance Model 2",
subtitle = "cp is still the most important variable") +
ggthemes::theme_fivethirtyeight()The executive summaries are divided into 3 parts: evidence, interpretation, and recommendations.
Model 2 has higher prediction power than Model 1 as can be seen in the metrics, roc_auc curve and confusion matrix. Model 1 and 2 are better than null model as can be seen in the roc_auc curve. Model 2 has 84% accuracy, 95% roc_auc, 87% F-measure, 81% precision, and 93% recall. We found that cp, thall, caa, oldpeak, thalachh have more significant importance for predicting heart attack risk.
Chest pain (cp) is the most direct and important variable to predict heart attack risk. When a patient have chest pain type 1-3, the patient will have high risk of heart attack. Thal rate (thal) is the second important variable to predict heart attack risk. Type 2 of thal rate has higher risk on heart attack. Number of major vessels (caa) is the third important variable to predict heart attack risk. When a patient don’t have any caa (caa = 0), the patient will have higher risk to get heart attack.
Using this prediction model, insurance company can make more accuracy prediction of heart attack on applicant. When applying for insurance the company should especially require the applicant to have a medical check up to check applicant’s chest pain (cp), thall rate (thall), caa (number of major vessels), previous peak (oldpeak), maximum heart rate achieved (thalachh). With this predictive modelling, the insurance company can do price adjustment for premium of different customers, thus improving their benefits.
For Recipe 1 we just put the variables without taking many considerations. We create thorough analysis for the Recipe 2 in order to achieve highest possible result. This model can be improved further by observing interactions among variables.